Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add plaintext client as alternative to SSH. #198

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

putnopvut
Copy link
Contributor

This offers an alternative to the SSH client currently used during scale
testing.

This instead connects to a server using plaintext and sends commands
this way. It is assumed that a server is running on the server on port 8000
Because ncat is a good example of such a server, the client is called
NCatClient in this change.

The reason for this new method is to eliminate the overhead of
connecting over SSH from the measurements in tests. For reasonably
large-scale tests (~300 nodes) it results in a bout a 7% reduction in
execution time.

One notable difference between this new method and using SSH is that
connections are created per host, rather than per sandbox. This allows
for clients that connect to sandboxes hosted on the same machine to
share the same connection. This has only been tested with tests that
execute serially. Tests that run scenarios in parallel may encounter
issues.

If there is no server listening, and a connection cannot be created, we
fall back to using SSH.

This offers an alternative to the SSH client currently used during scale
testing.

This instead connects to a server using plaintext and sends commands
this way. It is assumed that a server is running on the server on port 8000
Because ncat is a good example of such a server, the client is called
NCatClient in this change.

The reason for this new method is to eliminate the overhead of
connecting over SSH from the measurements in tests. For reasonably
large-scale tests (~300 nodes) it results in a bout a 7% reduction in
execution time.

One notable difference between this new method and using SSH is that
connections are created per host, rather than per sandbox. This allows
for clients that connect to sandboxes hosted on the same machine to
share the same connection. This has only been tested with tests that
execute serially. Tests that run scenarios in parallel may encounter
issues.

If there is no server listening, and a connection cannot be created, we
fall back to using SSH.
Copy link
Collaborator

@hzhou8 hzhou8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit message mentioned: "One notable difference between this new method and using SSH is that connections are created per host, rather than per sandbox. ..." For my understanding the SSH connections are per host (farm node), too, not per sandbox. cc @l8huang

Otherwise LGTM. (Please rebase)

try:
global NCAT_CLIENT_CACHE
server = cred["host"]
return NCAT_CLIENT_CACHE.setdefault(server, NCatClient(server))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a option for selecting client? It's better to keep ssh client as default instead of a fallback. There could be some security concerns about using plain text client in corp network.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely, I can do that.

@l8huang
Copy link
Collaborator

l8huang commented May 31, 2020

For my understanding the SSH connections are per host (farm node), too, not per sandbox.

@hzhou8 correct, SSH connections are per farm node.

@putnopvut Thanks for this performance improvement. Could you please add some documents about server-side settings for using this NCatClient?

@putnopvut
Copy link
Contributor Author

Sure, in the next revision, I'll add documentation about the server side.

My understanding was that there could be multiple farm nodes per physical server. For instance, you might have 10 farm nodes implemented as containers on a single physical server. In other words, the credential['host'] and credential['password'] of all 10 farm nodes would be the same values. Each of these farm nodes would then use a separate SSH connection, even though they're all connecting to the same physical server. With the plaintext client, I've set it up so that those 10 farm nodes will share the same connection instead. This reduces the number of connections required during the test since all farm nodes can share the same connection.

However, this has the drawback that if you are running parallel scenarios, this will cause test failures. Scenarios that are run in parallel can potentially receive messages intended for the other running scenarios.

raise NCatError(details)

def close(self):
# Test scenarios call close after every operation because with SSH,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my understanding this is not true for SSH. We don't close connections for every operation. SSH connection is cached for every farm node, which models a physical server. For each new operation, it doesn't recreate new SSH connection, but just reusing existing connection to the node. Connections are closed when the test is done and the test program exits.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I have an explanation for this :)

This patch was originally made against https://github.com/dceara/ovn-scale-test/tree/ovn-switch-per-node-devel . When I discussed with Dumitru about pushing this change, we agreed to open a PR here first and then we could port it to that particular branch after.

On that branch, we have added close() calls in a lot of places. We ran into issues where tests would fail after a while. We could no longer open new SSH connections because the servers would reject incoming connections due to having too many open. We added close() calls after basically every SSH command. My comment here makes sense in the context of how that branch operates, but here it's just wrong. I'll correct it.

By the way, if you're wondering why we are running into this problem of too many SSH connections, it likely has to do with our different understanding of farm nodes. As I said in another comment, we configure our tests to have multiple farm nodes per physical machine. So the number of SSH connections can pile up more quickly than if each farm node maps to a single physical machine.

Copy link
Collaborator

@dceara dceara Jun 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@putnopvut sorry for the confusion with the branches in my fork. Branch ovn-switch-per-node-devel is stale. What we use downstream is ovn-switch-per-node and in that branch I removed all the close() calls because we addressed the issues we had, i.e., use runner type serial instead of a single runner that would connect to all farm nodes. Like this only the connections used in one of the iteration of the runner will be open (and will get cleaned up when the iteration ends).

I'll try to open PRs for all the stuff that's only in my fork as soon as I get the chance. I hope that will clear things out.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explain @putnopvut and @dceara

@hzhou8
Copy link
Collaborator

hzhou8 commented Jun 3, 2020

My understanding was that there could be multiple farm nodes per physical server. For instance, you might have 10 farm nodes implemented as containers on a single physical server. In other words, the credential['host'] and credential['password'] of all 10 farm nodes would be the same values. Each of these farm nodes would then use a separate SSH connection, even though they're all connecting to the same physical server. With the plaintext client, I've set it up so that those 10 farm nodes will share the same connection instead. This reduces the number of connections required during the test since all farm nodes can share the same connection.

Sorry that we didn't thought about farm node that way. A farm node models a physical server, and each physical server runs N sandboxes or containers that simulate hypervisors. So in what use case would need multiple farm nodes on a same physical server?

On the other hand, assume there is an use case, but how does ovn-scale-test tell which farm nodes are on same physical server so that they can share same connection? Can SSH use the same mechanism to share connections between multiple farm nodes on same physical server? Why would plaintext connection be different from SSH on that front?

@putnopvut
Copy link
Contributor Author

putnopvut commented Jun 3, 2020

My understanding was that there could be multiple farm nodes per physical server. For instance, you might have 10 farm nodes implemented as containers on a single physical server. In other words, the credential['host'] and credential['password'] of all 10 farm nodes would be the same values. Each of these farm nodes would then use a separate SSH connection, even though they're all connecting to the same physical server. With the plaintext client, I've set it up so that those 10 farm nodes will share the same connection instead. This reduces the number of connections required during the test since all farm nodes can share the same connection.

Sorry that we didn't thought about farm node that way. A farm node models a physical server, and each physical server runs N sandboxes or containers that simulate hypervisors. So in what use case would need multiple farm nodes on a same physical server?

In our case, we are using ovn-fake-multinode to set up containers representing the farm nodes. In our rally deployment, we specify "install_method": "physical" for our farm nodes, rather than "sandbox" or "docker", along with a "host_container" setting. To see an example of what this does, have a look at rally_ovs/plugins/ovs/ovs_clientsimpl.py::_OvnNbctl.run() . It prepends whatever command we are to run with sudo docker exec {host_container}.

On the other hand, assume there is an use case, but how does ovn-scale-test tell which farm nodes are on same physical server so that they can share same connection? Can SSH use the same mechanism to share connections between multiple farm nodes on same physical server? Why would plaintext connection be different from SSH on that front?

The ovn_multihost context gathers data from the deployment (see get_ovn_multihost_info), including ssh credentials. By referencing this context data, we can tell if multiple farm nodes have the same ssh host.

I think SSH could use this same mechanism if desired. My mindset with creating this patch was to focus on the plaintext client, not to add features to the SSH client. However, I can see how this could be frustrating to have two clients that operate differently for seemingly no reason. For my next iteration on this pull request, I'll make the plaintext client work the way SSH does now. I can make a later change to alter both SSH and plaintext to share connections between farm nodes if possible.

@hzhou8
Copy link
Collaborator

hzhou8 commented Jun 4, 2020

My understanding was that there could be multiple farm nodes per physical server. For instance, you might have 10 farm nodes implemented as containers on a single physical server. In other words, the credential['host'] and credential['password'] of all 10 farm nodes would be the same values. Each of these farm nodes would then use a separate SSH connection, even though they're all connecting to the same physical server. With the plaintext client, I've set it up so that those 10 farm nodes will share the same connection instead. This reduces the number of connections required during the test since all farm nodes can share the same connection.

Sorry that we didn't thought about farm node that way. A farm node models a physical server, and each physical server runs N sandboxes or containers that simulate hypervisors. So in what use case would need multiple farm nodes on a same physical server?

In our case, we are using ovn-fake-multinode to set up containers representing the farm nodes. In our rally deployment, we specify "install_method": "physical" for our farm nodes, rather than "sandbox" or "docker", along with a "host_container" setting. To see an example of what this does, have a look at rally_ovs/plugins/ovs/ovs_clientsimpl.py::_OvnNbctl.run() . It prepends whatever command we are to run with sudo docker exec {host_container}.

On the other hand, assume there is an use case, but how does ovn-scale-test tell which farm nodes are on same physical server so that they can share same connection? Can SSH use the same mechanism to share connections between multiple farm nodes on same physical server? Why would plaintext connection be different from SSH on that front?

The ovn_multihost context gathers data from the deployment (see get_ovn_multihost_info), including ssh credentials. By referencing this context data, we can tell if multiple farm nodes have the same ssh host.

I think SSH could use this same mechanism if desired. My mindset with creating this patch was to focus on the plaintext client, not to add features to the SSH client. However, I can see how this could be frustrating to have two clients that operate differently for seemingly no reason. For my next iteration on this pull request, I'll make the plaintext client work the way SSH does now. I can make a later change to alter both SSH and plaintext to share connections between farm nodes if possible.

Make sense! Thanks for the explanation.

@putnopvut
Copy link
Contributor Author

Sorry for the slow turnaround on this, but essentially I've been down a rabbit hole on this. When it was suggested to add configuration for plaintext vs. ssh, it got me to look into how the best way to configure it was. It looks like the deployment is the proper place to do it. And from there, I found out that I had missed some key uses of SSH in the code, and that SSH use is baked in more heavily than I had previously thought.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants